Skip to content

Add DTLS throughput benchmark tool and optimize send path#10551

Open
julek-wolfssl wants to merge 2 commits into
wolfSSL:masterfrom
julek-wolfssl:dtls-perf-benchmark
Open

Add DTLS throughput benchmark tool and optimize send path#10551
julek-wolfssl wants to merge 2 commits into
wolfSSL:masterfrom
julek-wolfssl:dtls-perf-benchmark

Conversation

@julek-wolfssl

Copy link
Copy Markdown
Member

Add examples/benchmark/dtls_bench, a DTLS throughput benchmark that completes a handshake and then measures bulk-send throughput. It supports DTLS 1.2 and 1.3, selectable cipher suites, an end-to-end mode, and a -z sink mode that discards records on the server after the handshake to isolate the sender's record-layer cost. The socket is set up with wolfSSL_set_dtls_fd_connected.

Optimize the send path exercised by the benchmark:

  • wolfio (EmbedSendTo): cache the per-descriptor socket-type probe (getsockopt SO_TYPE) in WOLFSSL_DTLS_CTX instead of running it on every send, removing a syscall from the record send path. The cache is invalidated whenever rfd/wfd is reassigned.

  • internal (BuildMessage): for AEAD suites whose explicit nonce is the 8-byte record sequence number, write the sequence number directly as nonce_explicit instead of drawing it from the RNG. This covers AES-GCM (RFC 5288 sec 3), AES-CCM (RFC 6655 sec 3), SM4-GCM/CCM (RFC 8998 sec 3), and Camellia-/ARIA-GCM which inherit the RFC 5288 construction; ChaCha20 uses an implicit nonce and is excluded. A new read-only PeekSEQ() helper reads the sequence number without advancing the per-direction counter, leaving the single mandated increment to writeAeadAuthData().

Also ignore the built dtls_bench binary in .gitignore.

Copilot AI review requested due to automatic review settings May 28, 2026 14:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new DTLS throughput benchmark under examples/benchmark/ and makes two optimizations in the DTLS send path to better measure (and reduce) per-record overhead in wolfSSL’s record layer and socket I/O glue.

Changes:

  • Add examples/benchmark/dtls_bench.c: a DTLS 1.2/1.3 throughput benchmark with cipher selection, plain-UDP baseline mode, and a client-side “sink send” mode.
  • Optimize DTLS send path by caching the SO_TYPE (datagram vs stream) probe in WOLFSSL_DTLS_CTX instead of calling getsockopt() on every send.
  • Optimize AEAD explicit-nonce construction by writing the record sequence number directly for suites where the explicit nonce is defined as the seq number, using a new read-only PeekSEQ() helper.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
wolfssl/internal.h Adds DTLS context fields for caching socket type probe results.
tests/api.c Resets new DTLS context cache fields when copying SSL state in an API test helper.
src/wolfio.c Changes datagram-vs-stream detection to cache SO_TYPE results.
src/ssl.c Invalidates the DTLS socket-type cache when read/write fds are (re)assigned.
src/internal.c Adds PeekSEQ() and uses it to derive AEAD explicit nonce from sequence number for applicable suites.
examples/benchmark/include.am Adds dtls_bench to Automake build outputs.
examples/benchmark/dtls_bench.c New DTLS benchmark tool implementation.
.gitignore Ignores the built examples/benchmark/dtls_bench binary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/wolfio.c
Comment thread examples/benchmark/include.am
Comment thread examples/benchmark/dtls_bench.c
Comment thread examples/benchmark/dtls_bench.c
@julek-wolfssl julek-wolfssl self-assigned this May 28, 2026
@julek-wolfssl julek-wolfssl force-pushed the dtls-perf-benchmark branch from 8d445f1 to 32c7f0b Compare May 28, 2026 17:02
@julek-wolfssl julek-wolfssl marked this pull request as ready for review May 28, 2026 17:19
@github-actions

Copy link
Copy Markdown

retest this please

@julek-wolfssl julek-wolfssl force-pushed the dtls-perf-benchmark branch 2 times, most recently from 7b5387d to 4068636 Compare May 29, 2026 16:27
@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown

MemBrowse Memory Report

gcc-arm-cortex-m3

  • FLASH: .text +104 B (+0.1%, 121,009 B / 262,144 B, total: 46% used)

gcc-arm-cortex-m4

  • FLASH: .text +128 B (+0.1%, 198,318 B / 262,144 B, total: 76% used)

gcc-arm-cortex-m4-dtls13

  • FLASH: .text +64 B (+0.0%, 179,352 B / 1,048,576 B, total: 17% used)

gcc-arm-cortex-m4-openssl-compat

  • FLASH: .text +64 B (+0.0%, 765,740 B / 1,048,576 B, total: 73% used)

gcc-arm-cortex-m4-rsa-only

  • FLASH: .text +64 B (+0.0%, 322,168 B / 1,048,576 B, total: 31% used)

gcc-arm-cortex-m4-tls12

  • FLASH: .text +64 B (+0.1%, 121,741 B / 262,144 B, total: 46% used)

gcc-arm-cortex-m7

@julek-wolfssl julek-wolfssl force-pushed the dtls-perf-benchmark branch from 4068636 to 0d93481 Compare May 31, 2026 14:02
@julek-wolfssl

julek-wolfssl commented May 31, 2026

Copy link
Copy Markdown
Member Author

retest this please.

@julek-wolfssl julek-wolfssl force-pushed the dtls-perf-benchmark branch from 0d93481 to 0c90533 Compare June 1, 2026 11:12
Add examples/benchmark/dtls_bench, a DTLS throughput benchmark that
completes a handshake and then measures bulk-send throughput. It
supports DTLS 1.2 and 1.3, selectable cipher suites, an end-to-end
mode, and a -z sink mode that discards records on the server after the
handshake to isolate the sender's record-layer cost. The socket is set
up with wolfSSL_set_dtls_fd_connected.

Optimize the send path exercised by the benchmark:

- wolfio (EmbedSendTo): cache the per-descriptor socket-type probe
  (getsockopt SO_TYPE) in WOLFSSL_DTLS_CTX instead of running it on
  every send, removing a syscall from the record send path. The cache
  is invalidated whenever rfd/wfd is reassigned.

- internal (BuildMessage): for AEAD suites whose explicit nonce is the
  8-byte record sequence number, write the sequence number directly as
  nonce_explicit instead of drawing it from the RNG. This covers
  AES-GCM (RFC 5288 sec 3), AES-CCM (RFC 6655 sec 3), SM4-GCM/CCM
  (RFC 8998 sec 3), and Camellia-/ARIA-GCM which inherit the RFC 5288
  construction; ChaCha20 uses an implicit nonce and is excluded. A new
  read-only PeekSEQ() helper reads the sequence number without advancing
  the per-direction counter, leaving the single mandated increment to
  writeAeadAuthData().

Also ignore the built dtls_bench binary in .gitignore.
@julek-wolfssl

Copy link
Copy Markdown
Member Author

The jenkins failures are not related to this PR.

@douzzer douzzer added the Staged Staged for merge pending final test results and review label Jun 5, 2026
douzzer
douzzer previously requested changes Jun 5, 2026
Comment thread examples/benchmark/dtls_bench.c
@douzzer douzzer removed the Staged Staged for merge pending final test results and review label Jun 5, 2026
dtls_bench.c is built whenever DTLS and the example servers are enabled,
including the cross-mingw-all-crypto multi-test scenario, which cross-
compiles for Windows. It directly includes POSIX-only headers
(<sys/socket.h>, <arpa/inet.h>, <netdb.h>, <net/if.h>) that mingw does
not ship, so the build failed there.

Gate the networking includes and the whole benchmark body behind a
DTLS_BENCH_ENABLED check (WOLFSSL_DTLS, not USE_WINDOWS_API, not
WOLFSSL_NO_SOCK). When the platform lacks POSIX BSD sockets, compile a
small stub main() that reports the tool is unsupported, so the source
tree still builds.

@dgarske dgarske left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skoll Code Review

Scan type: reviewOverall recommendation: COMMENT
Findings: 5 total — 5 posted, 0 skipped
5 finding(s) posted as inline comments (see file-level comments below)

Posted findings

  • [Medium] isDGram cache write is an unsynchronized data race under WOLFSSL_RW_THREADEDsrc/wolfio.c:655-666
  • [Low] Single isDGram cache shared between rfd and wfd of potentially different socket typessrc/wolfio.c:649-667
  • [Low] Confirm discarded explicit-nonce value is intentional across FIPS and epoch-order pathssrc/internal.c:24809-24831
  • [Low] now_sec ignores clock_gettime failure, can return uninitialized timeexamples/benchmark/dtls_bench.c:96-101
  • [Low] -z (sink send) silently ignored when combined with -s (server)examples/benchmark/dtls_bench.c:parse_args

Review generated by Skoll

Comment thread src/wolfio.c
/* optvalue 'type' is of size int */
XSOCKLENT length = (XSOCKLENT)sizeof(type);

if (dtlsCtx->isDGramCached)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 [Medium] isDGram cache write is an unsynchronized data race under WOLFSSL_RW_THREADED

Previously isDGramSock() was a pure read of the socket type. It now lazily writes dtlsCtx->isDGram and dtlsCtx->isDGramCached (bit-fields that share a storage unit with connected, and with processingPendingRecord when WOLFSSL_DTLS_CID is set). In a WOLFSSL_RW_THREADED build, EmbedSendTo (write thread) calls this on every send and EmbedReceiveFrom (read thread) calls it on the recvd == 0 path, so two threads can perform the first-time write concurrently. Writing adjacent bit-fields is a read-modify-write of the shared byte and is not atomic, so this is a data race (UB per the C memory model) on dtlsCtx. In practice it is benign because both threads write the same socket-type value and connected is not modified concurrently, but it can be reported by ThreadSanitizer CI and is a new race introduced by moving shared-state mutation into the hot I/O path. The rest of dtlsCtx peer state is already protected by peerLock precisely because these callbacks run concurrently.

Fix: Confirm intent for threaded builds; prefer computing the cache eagerly at fd assignment to remove the race from the data path entirely.

Comment thread src/wolfio.c
* fixed for its lifetime, so the getsockopt(SO_TYPE) probe runs once per
* connection and is cached in dtlsCtx; the cache is invalidated wherever
* rfd/wfd is reassigned. */
static int isDGramSock(WOLFSSL_DTLS_CTX* dtlsCtx, int sfd)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 [Low] Single isDGram cache shared between rfd and wfd of potentially different socket types

The cache stores one isDGram flag per dtlsCtx, but it is consulted for both the read descriptor (dtlsCtx->rfd, via EmbedReceiveFrom) and the write descriptor (dtlsCtx->wfd, via EmbedSendTo). If an application sets rfd and wfd to different sockets of different types (legal via wolfSSL_set_read_fd/wolfSSL_set_write_fd), the first probe wins and the cached value is then incorrectly applied to the other descriptor. This is unusual for DTLS (rfd == wfd normally) and the cache is invalidated on each fd reassignment, so the common path is fine, but the original code re-probed the exact sfd passed in each time, whereas the cache is now keyed on the context rather than the descriptor.

Fix: Confirm the rfd==wfd (or same-type) assumption holds for all supported DTLS usages; document it next to the cache fields.

Comment thread src/internal.c
ret = wc_RNG_GenerateBlock(ssl->rng, args->iv, args->ivSz);
if (ret != 0)
goto exit_buildmsg;
#ifdef HAVE_AEAD

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 [Low] Confirm discarded explicit-nonce value is intentional across FIPS and epoch-order paths

The new code writes the record sequence number into the explicit nonce via PeekSEQ(ssl, epochOrder, args->iv) instead of wc_RNG_GenerateBlock. In every standard configuration this written value is subsequently overwritten before transmission: the non-FIPS/FIPSv2+ path copies the cipher-generated IV over it (XMEMCPY(out, ssl->encrypt.nonce + AESGCM_IMP_IV_SZ, ...) at line ~20741), and the legacy-FIPS path overwrites args->iv with ssl->keys.aead_exp_IV (line ~24839). So the real effect of this change is removing the per-record RNG draw; the seq value itself is intentionally discarded. Note also that the actual AAD seq is written with CUR_ORDER inside writeAeadAuthData (line ~20695) while PeekSEQ here uses epochOrder, so the two would differ for DTLS SCR PREV_ORDER retransmits, but this is harmless precisely because the PeekSEQ-written bytes are discarded. This is correct but subtle; a future reader could mistakenly assume the on-wire explicit nonce equals PeekSEQ.

Fix: Keep the change; add a short clarifying comment to prevent future misreading and confirm the RNG-removal is the intended win.

#define DEFAULT_SOCKBUF (8 * 1024 * 1024)

/* Upper bound on per-record overhead the user must leave under MTU.
* DTLS 1.2 worst case: 13 (hdr) + 8 (explicit nonce) + 16 (AEAD tag) = 37.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 [Low] now_sec ignores clock_gettime failure, can return uninitialized time

clock_gettime()'s return value is cast to void and ts is not initialized, so on the (rare) failure of CLOCK_MONOTONIC the function returns garbage from the stack, corrupting all throughput/duration math. This is benchmark-only code so impact is limited, but it is trivial to harden.

Fix: Initialize ts and check the return value, returning a sentinel on failure.

break;
case 'b': c->sockBuf = atoi(optarg); break;
case 'n': c->plainUdp = 1; break;
case 'z': c->sinkSend = 1; break;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 [Low] -z (sink send) silently ignored when combined with -s (server)

The help text documents -z as client-only, but parse_args performs no validation when -z is passed together with -s. In server mode sinkSend is parsed and then never used, so a user invoking -s -z gets no warning that the flag had no effect. Minor usability nit in example code.

Fix: Emit a warning (or error) when -z is combined with -s so the no-op is visible to the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants